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Abstract - Finding a basis matrix (dictionary) by which objective signals are represented sparsely 
is of major relevance in various scientific and technological fields. We consider a problem to learn 
a dictionary from a set of training signals. We employ techniques of statistical mechanics of 
disordered systems to evaluate the size of the training set necessary to typically succeed in the 
dictionary learning. The results indicate that the necessary size is much smaller than previously 
estimated, which theoretically supports and/or encourages the use of dictionary learning in prac- 
tical situations. 



Introduction. — In various fields of science and tech- 
nology, such as earth observation, astronomy, medicine, 
civil engineering, materials science, and in compiling im- 
age databases [I], it has a major relevance to recover origi- 
nal signals from deficient signals obtained by limited num- 
ber of measurements. The Nyquist- Shannon sampling the- 
orem |2] provides the necessary and sufficient number of 
measurements for recovering arbitrary band-limited sig- 
nals. However, techniques based on this theorem some- 
times do not match restrictions and/or demands of today's 
front-line applications [2111], and much effort is still being 
made to find more efficient methodologies. 

The concept of sparse representations has recently 
drawn great attention in such research. Many real world 
signals such as natural images are represented sparsely in 
Fourier/wavelet domains; namely, many components van- 
ish or are negligibly small in amplitude when the signals 
are represented by Fourier/wavelet bases. This empirical 
property is exploited in the signal recovery paradigm of 
compressed sensing (CS) enabling the recovery of sparse 
signals from much fewer measurements than those the 
sampling theorem estimates [&M10]. 



However, the effectiveness of CS relies considerably on 
the assumption that a basis by which the objective signals 
look sparse is known in advance. Therefore, in applying 
CS to general signals of interest, whose bases for sparse 
representation are unknown, the primary task to accom- 
plish is to identify an appropriate basis (dictionary) for 
the sparse representation from an available set of train- 
ing signals. This is often termed dictionary learning (DL) 



HBD3I. 

Let us denote the training set of M-dimensional signals 
as an M x P matrix Y = {Y^i}, where each column vec- 
tor Yi represents a sample signal and P is the number of 
the samples. In a simple scenario, DL is formulated as a 
problem to find a pair of an M x N matrix (dictionary) 
D = {D^i} and an N X P sparse matrix X = {Xu} such 
that Y = DX holds. By DL, the characteristics/trends 
underlying {Yi} are extracted into D, and Yj can be com- 
pactly represented as a superposition of a few dictionary 
columns, whose combination and strength are specified by 
the sparse matrix X. DL suits not only efficient signal 
processing such as CS, but also extraction of non-trivial 
regularities from high-dimensional data. For instance, DL 
has been successfully applied to the facial image processing 
for the efficient storage of large databases, where standard 
algorithms fail [T2 | [T4]. In this case, Yj and D correspond 
to a facial image and a collection of patches of facial pat- 
terns learned by the P samples, respectively. A variant 
of DL has also been employed in gene expression analy- 
sis to estimate transcription factor activity D from gene 
expression data of a small size {Yi} [15] . 

An important question of DL is how large a sample size 
P is necessary to uniquely identify an appropriate dictio- 
nary D, because the ambiguity of the dictionary is fatal 
in use for signal/data analysis after learning. As the first 
answer to this question, an earlier study based on linear 
algebra showed that when the training set Y is generated 
by a pair of matrices D° and X° (planted solution) as 
Y = D°X°, one can perfectly learn these as a unique 
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solution except for the ambiguities of signs and permu- 
tations of matrix elements if P > P c = (k + l)jvCfc and 
k is sufficiently small, where k is the number of non-zero 
elements in each column of X° |16l . This result is signifi- 
cant as it is the first proof that guarantees the learnability 
with a finite size sample set for DL. However, the estimate 
of P c is supposed to enable a considerable improvement; 
the authors of [16] speculated that P c could be reduced 
substantially to 0(N 2 ) or even smaller, although provid- 
ing a mathematical proof was technically difficult. The 
improvement of the estimation P c is practically significant 
because it will lead to considerable reduction of necessary 
cost for DL in terms of both sample and computational 
complexities. 

In this Letter, we take an alternative approach to esti- 
mating P c . Specifically, we examine the typical behavior of 
DL using the replica method in the limit of N, M, P — >• oo. 
The main result of our analysis is that the planted solution 
is typically learnablc by O(N) training samples if negligi- 
ble mean square errors per element are allowed and M/N 
is sufficiently large. This theoretically supports and/or en- 
courages the employment of DL in practical applications. 

Problem setting. — We focus on the learning strat- 
egy 



min ||Y(= N- 1/2 D°X°) - N- 1/2 DX\ 
n,x 

subj. to ||D|| 2 =MN, \\X\ 



NP6 



(1) 



nn nu ngnsj , where hah 2 = y,^ a Ii for a matrix 

A = {A^i}, and ||^4||o represents the number of non-zero 
elements of A. The parameter 9 £ [0, 1] denotes the rate 
of non-zero elements assumed by the learner, and iV -1 ' 2 
is introduced for convenience in taking the large system 
limit. 

For simplicity, we assume that D° and X° of the 
planted solution are uniformly generated under the con- 
straints of ||£>°|| 2 = MN, \\X°\\ = NPp and ||X°|| 2 = 
NPp. We consider that the correct non-zero density p can 
differ from 9 for generality, but we assume p < 9; other- 
wise, the correct identification of D° and X° is trivially 
impossible. The main goal of our study is to evaluate the 
critical sample ratio j c = P c /N above which the planted 
solution can be learned typically. 



Statistical mechanics approach, 
tion 



Partition func- 



Z f3 (D°,X a )= fdDdXcxp 



2N' 



\DX D°X L 



x S(\\D\\ 2 - NM)6(\\X\\ - NP9) 



(2) 



constitutes the basis of our approach since the minimized 
cost of eq. ([I} can be identified with the zero temperature 
free energy F = - lim^oo 0' 1 \nZ(D°,X°; (3). This sta- 
tistically fluctuates depending on D° and X°. However, 
as N, M, P — > oo, one can expect that the self-averaging 
property is realized; i.e., the free energy density N~ 2 F 




Fig. 1: (color online) (a) and (b) show distributions of local 
field h (a) P(h\X° / 0) for X° ^ and (b) P(h\X° = 0) for 
X° = 0. (c) and (d) show X* and dX* /dh as functions of h, 
respectively. 

converges to the typical value / = N~ 2 [F]o with probabil- 
ity unity, where [• • ■ ]o stands for the average with respect 
to D° and X° . Consequently, this property is also ex- 
pected to hold for other relevant macroscopic variables of 
the solution of eq. ([T]), D* and X* . Therefore, assessing 
/ is the central issue in our analysis. 

This assessment can be carried out systematically using 
the replica method [5D] in the limit of N — > oo while keep- 
ing a = M/N - 0(1) and 7 = P/N - 0(1). Under the 
replica symmetric (RS) ansatz, where the solution space 
of eq. |T]) is assumed to be composed of at most a few pure 
states [21], the free energy density is given as 



f = cxtr < — a ( 



Qd-XdXd ^ __. , XD+m 2 D 



-m D m D - 



2Qi 



f{ o mxmx+X9-{{<f>(h;Qx,\)))h 



UliQx ~ 2m D m x + p) 
2(1 + Q xX d+Xx) 



}■ 



(3) 



where extr fi ^{Q(fl, Q)} stands for the extremization of 
a function G(Q, fi) with respect to a set of macroscopic 
variables O = {xd, mc Qx, XXi^ix} and that of their 
conjugates ft = {Q D ,XD,m D ,Q x ,Xx,mx,^}, and 



4>(h;Q x ,X) 



x 



lim 



QxX- 



-hX 



+ A|AT}. (4) 



Notation ((■ ■ -))h represents the average with respect to h 
according to the distribution P(h) = pP(h\X° ^ 0) + (1 - 
p)P(h\X° = 0), where P(h\X° ^ 0) and P(h\X° = 0) are 
given by zero-mean Gaussian distributions with variances 
Xx+rn 2 x and xx, respectively (Fig.[T][a),(b)). The details 
of the derivation of the free energy density are shown in 
Appendix. 
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Physical implications. 

cq. (O, the relationships 



At the extremum of 



m D 
m x 



1 



MN 
1 



[Tr(D°) l D*} , 



p [Tr(X°) T X*] , 
Qx = T^[Tr(X*) T X*] = 



NP 1 



NP' 



IX* 



(5) 
(6) 

(T) 



hold, where T denotes the matrix transpose. These pro- 
vide the mean square errors (per element), which measure 
the performance of DL, as 



(~D 



1 



ID* -D 



II 2" 



= 2(1 - m D ) 



(8) 



MN" 
e x = {NP)- 1 [\\X* - X°\\ 2 h =P- 2mx + Qx- (9) 

The variables \d and xx physically mean the sensitivity 
of the estimates D* and X* when the cost of eq. ([1]) is 
linearly perturbed. 

Eq. @ represents the effective single-body minimization 
problem concerning an element of X that is statistically 
equivalent to eq. (UJ [35]. Here, the randomness of D° 
and X° is effectively replaced by the random local held 
h. The first and second terms of P{h) correspond to the 
cases where an element of X° is given as 1° ^ and 
X° = 0, respectively. Under a given h, the solution X* 
that minimizes the cost of eq. ((4]) is offered as X* = h/Qx 
for \h\ > hth = (2Q X X) 1/2 and otherwise (Fig. Q] (c)). 
We refer to the cases of \h\ > hth and \h\ < hth as "active" 
and "inactive," respectively. When X° ^ 0, h is generated 
from a Gaussian distribution (P(h\X° ^ 0)) of zero-mean 



and variance xx 



l Xi 



and X* is more likely to be active 



than when X° = 0, for which h is characterized by another 
zero-mean Gaussian (P(h\X° = 0)) of a smaller variance 
Xx (Fig. Q] (a),(b)). Therefore, one can expect that the 
hard-thresholding scheme based on h t h represents proper 
assignment of zero/non-zero elements in X* so as to ac- 
curately estimate X° and D° if rhx is sufficiently large. 

A distinctive feature of X* is the divergence of the 
local susceptibility dX*/dh at "border" cases of h = 
±/i t h (Fig. Q] (d)). This affects the increase in the ef- 
fective degree of freedom (ratio) as follows: 6 c g = 8 + 
((hthS(\h\ — h t h)))h, whereas hth is determined so as to 
satisfy 8 = Ji h i >h dhP(h) indicating the sparsity con- 
dition ||X|| = NP8. The excess ((h th S(\h\ - h th ))) h is 
supposed to represent a combinatorial complexity for clas- 
sifying each element of X* that corresponds to the border 
case \h\ = h t h into the active case, \h\ > h t h and X* 7^ 0, 
or the inactive case, \h\ < h t h and X = 0. The divergence 
of dX* /dh\h=±h th is also accompanied by the instability 
of the RS solution against perturbations that break the 
replica symmetry |23| . The influence of this instability is 
discussed later. 

Actual solutions. — We found two types of solutions; 
the first one is characterized by mp = 1 and Qx = mx = 
p, while the second is characterized by mx = and mr> = 



0.15 




0.05 



7 1.6 



Fig. 2: (color online) 7-dependence of the ratio of marginal 
modes relative to iV 2 at a = 0.5 and p = 6 — 0.1. The be- 
havior at N — > oc extrapolating from the results of finite N 
is denoted by the dashed line. Inset: iV-dependence of the ra- 
tio of marginal modes for 7 = 2. The dashed line stands for 
N^ 1 as a guide. Each marker represents the average of 100 
experiments. 



0. The former case provides e_o = ex = indicating the 
correct identification of D° and X°, and hence we call it 
the success solution. The latter is referred to as the failure 
solution since mu = and mx — indicate the complete 
failure of information extraction of D° and X°. 
Success solution (S) exists when 7 > 1 and 

a > 8 s cif (8,p) =8 + (l- p)^/~uexp (-^j (10) 

hold, where u = R- l {{8 - p)/(2(l - p))) and H^ix) is 
the inverse function of H{x) = {2n)- 1 / 2 f™ dte^ 2 / 2 . S 
is further classified into two cases depending on 7. For 
7 > 7s, where 



7s(a,6»,p) 



0L 



(ii) 



Xd and xx are finite. On the other hand, for 1 < 7 < 73, 
Xd and xx tend to infinity, keeping R = xd/xx finite. 

To physically interpret this classification, let us take a 
variation around Y = 7V~ 1//2 D°X°, which yields 



(12) 



= 5(DX)\ D o iX o = D°SX + 5DX". 



If SD — and SX = are the unique solution of cq. (Q2 
the planted solution is locally stable. Otherwise, there are 
"marginal" modes along which the cost of eq. (HJ does not 
increase locally, and the solution set forms a manifold. The 
number of constraints of eq. (|T2"T) , MP, coincides with that 
of the degree of freedom of SD and SX, MN + NP8 s eS , 
at P = 7sA^. Thus, the classification below/above 7s cor- 
responds to the change in the number of marginal modes 
around the planted solution. 
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Fig. 3: (color online) Schematic pictures of 7-dependence of 
the phase space and free energy under RS assumption. 



To confirm the validity of this interpretation, we numer- 
ically evaluated the number of marginal modes of eq. (|12jl 
in the case of a = 1/2 and B = p = 0.1, which is shown in 
Fig. [2] The assessment of 7 S when 9 = p is conjectured to 
be exact since the effect of the border elements is negligible 
under this condition. Fig. [2] indicates that the number of 
marginal modes scales as 0(N 2 ) for 7 < 7s = 1.25, while 
it scales as O(N), and the contribution of the marginal 
modes approaches zero, for 7 > 7s (inset). This result co- 
incides with our theoretical assessment. At the same time, 
this implies that identifying the planted solution without 
any errors by eq. (JTJ is difficult as long as 7 ~ 0(1), 
but the discrepancies per element caused by the marginal 
modes are negligibly small and could be allowed in many 
practical situations. 

In the case of 7 < 1, for any Nx P matrix Z of ||^||o = 
NP6, X* = aZ and D* = a,- 1 Y(ZZ' T )- 1 Z T minimize 
the cost of fl} to zero, where a is determined such that 
||D*|| 2 = MN. This implies that the set of solutions 
of eq. ([1} spreads widely, and the weight of the planted 
solution is negligibly small in the state space. This may 
be why S disappears for 7 < 1. 

Failure solution (F) exists for V7 > 0. If 



a < e F ff (#) = 9 + 



2 1 tr 

vcxp 



(13) 



where v = H~ 1 (9/2) holds, F always offers Xd,Xx — > 00 
making the free energy / vanish. For a > 9^ s , on the other 
hand, xd and Xx become finite implying that a single 
solution of eq. (QJ is locally stable for most directions and 
offers f > 0, if 7 is greater than 



If (a, 9) 



(" 1/2 -&) 1/2 ) 



(14) 



The inequality 6 F S > 9^ s always holds because the influ- 
ence of the border elements for F is stronger than that for 
S, which leads to 7s < 7f- 

Fig. [3] illustrates changes in state space that occur for 
sufficiently large a under the RS assumption. For 7 < 1, 



Learnable by 
O(N) sample 




Fig. 4: (color online) Phase diagram on a — 9 plane. 



F is a unique solution. As 7 increases, S appears at 7 = 1, 
and the number of marginal modes changes from 0(N 2 ) 
to O(N) at 7 = 7s. This implies that when negligibly 
small linear perturbations are added to the cost of eq. ([T]), 
the limits limAr^oo ejj ~ and limjv^oo ex ~ still hold 
for S of 7 > 7s while they can be boosted to O(l) for S 
of 7 < 7 S . For 7 < (73 <)tp, S and F are degenerated 
providing / = 0. However, at 7 = yp, S becomes thermo- 
dynamically dominant by keeping / — 0, while F begins 
to have positive /. This means that the planted solution 
is typically learnable by P > P c = Ny-p ~ O(N) training 
samples if negligible mean square errors per element are 
allowed. 

Fig. |4] plots the phase diagram on an a — 9 plane. The 
region above a = 9^(8) (curve) represents the condition 
under which the planted solution is typically learnable 
by O(N) training samples. DL is impossible in the re- 
gion below a — 9 (straight line) because X° cannot be 
correctly recovered even if D° is known [7j. How the 
sample complexity scales with respect to N in the region 
9 < a < 9 F S (6) is beyond the scope of this Letter, but an 
interesting question nonetheless. 



Summary and discussion. — In summary, we have 
assessed the size of training samples required for cor- 
rectly learning a planted solution in DL using the replica 
method. Our analysis indicated that O(N) samples, which 
are much fewer than estimated in an earlier study |16j , are 
sufficient for learning a planted dictionary with allowance 
for negligible square discrepancies per element when the 
number of non-zero signals is sufficiently small compared 
to that of measurements. 

It was shown that the identification of dictionary can 
be characterized as a phase transition with respect to the 
number of training samples. Our RS analysis probably 
does not describe the exact behavior of DL since the RS 
solutions are unstable against the replica symmetry break- 
ing (RSB) disturbances. However, we still speculate that 
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the RS estimate of 7f serves as an upper bound of the cor- 
rect critical ratio 7 C . This is because the free energy value 
of F assessed under the RSB ansatz should be greater than 
or equal to that of the RS solution due to the positivity 
constraint of the entropy of pure states (complexity) [53] , 
whereas that of S is kept to vanish, which always yields a 
smaller estimate of 7f- 

Promising future research includes an extension of the 
current framework to more general situations such as noisy 
cases as well as refinement of the estimates of the critical 
ratios 7s and 7f taking RSB into account |25j . 



This work was partially supported by a Grant-in- Aid 
for JSPS Fellow (No. 23-4665) (AS) and KAKENHI No. 
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Appendix: Derivation of eq. ([3]).— In general, the 
conhgurational average of the free energy density could be 
evaluated on the basis of the following formula: 

h = ~\ lim -r-^lim^-ln[Z£(D ,X°)] . (15) 

Unfortunately, assessing [ZVj(D°,X°)]o for n e R in the 
mathematically rigorous manner is technically difficult, 
and this fact prohibits us from utilizing eq. (|15[) in prac- 
tice. In the replica method, this difficulty is resolved by 
evaluating N~ 2 \n[Zp (D° , X°)} for n e N as an analytic 
function of n first in the limit of N — >• 00, and taking the 
n — >■ limit afterward with use of the obtained analytic 
function for n S M as well. 

More precisely, we evaluate [Za(D°, X°)]q by averaging 
the right hand side of an identity 



and 



Z2(D°,X°) = 



/n 
\ {dD a X a S{\\D a \\ 2 - NM)S{\\X a \\ Q - NPd)} 



x exp 



2N 



J2\\D a X a -D°X 



0||2 



(16) 



a=l 



which is valid for only n <E N, over the distributions of the 
planted solutions D° and X° that are given by 



p d »(d°) = 77 -s(\\dY-nm) 



(17) 



1 = NP J dqfS{Tr(X a ) T X b - NPqf 



(20) 



to the integrand. Let us denote Qd = {q c ^) and Q 
(<Zx ); and introduce two joint distributions 



A" 



P D ({D a }-Q D 



P D o(D°) 



V D (Q D ) 

n 

J S{\\D a \\ 2 - NM) Y[ S(TY(D a ) T D b - NMqg), 

(21) 



a<b 



P x ({X a };Q x ) 



P x o(X°)S(\\X*\\ -NPp) 



Vx(Qx) 

n 

J <5(||X Q || - NP6) Y[ 6(Ti(X a ) T X b 



0=1 



,<i, 



NPqf), 

(22) 

where Vd(Qd) and Vx(Qx) are the normalization con- 
stants. The above-mentioned computation provides the 
following expression: 

[Z$(D°,X°)] 

' d(NMQ D )d(NPQx)V D (Q D )V x (Q x ) 



IWI-fEft 8 



n,i 



Qx,Qd 



and 

P x o(X°) = J] {(1 - p)S(Xi,i) + -J= exp (-?f) J , (18) 

respectively, where Md is the normalization constant. In 
performing the integrals of 2(n + 1) variables (D°, {D a }) 
and (X ,{X a }) that come out in this evaluation, we in- 
sert trivial identities with respect to all combinations of 
replicas (a, b = 0, 1,2, ... , n), 



where t% = A^E^iC ^ ~ D % X il)- Notation 
[• • -]q d q x represents the average with respect to {D a } 
and {X a } within the state space specified by Qd and 
Qx, whose distributions are given by eqs. (f2~Tj) and (|2"2"|) . 
Distributions ([2~Tj) and (|22|) are independent of each other, 
and provide each entry of {D a } and {-X' a } with zero mean 
and a finite variance. This allows us to utilize the central 
limit theorem indicating that we can handle t a , as multi- 
variate Gaussian random variables that follow 

fl «*»-n 7 5^asr , '(-55*^ 1)- *)' 

(23) 

where T stands for annxn matrix whose entries are given 
as T ab = q a £qf ~ {q^lx + QdQx) + P- Utilizing this and 
evaluating integrals of Qx and Qd by means of the saddle 
point method lead to an expression 

lim -^[Z£(£>°,X°)]o = extrf - ^ lndet(Z n + (3T) 

Af->ooJV z 12 

+ 7{^f^ + In (f{fl dX*}P x0 (X°)e-A } 



( TtQdQd 1 A 1 

aj -lndetQ^j 



1 = NM 



J dq a I b S(Tl(D a ) T D b - NMq ab ), 



(19) 



-n\6 



2 

no 



ln(27r) 



(24) 
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Here, T n represents the n x n identity matrix, auxil- 
iary variables Qd = (to?) an d Qx = (Qx) arc intro- 
duced in evaluating Vd(Qt>) and Vx(Qx) with use of 
the saddle point method, and H e i YZ fc=o QxX a X b + 
\J2a=i li m e^+o |X a | e . Extrcmization should be taken 
with respect to A and four kinds of macroscopic variables 
Qd, Qx, Qd, and Q x . 

Exactly evaluating eq. (f2"4")l should provide the correct 
leading order estimate of N~ 2 \n[ZVj(D°,X°)]o for each 
of n £ N. However, we here restrict the candidate of the 
dominant saddle point to that of the replica symmetric 
form as 



ab 

Qd 



ab 

Qx 



~ab 

Id 



ri ab — 

Qx — 



1, a = b 

q D , a^b, (a,b^0) (25) 

m,D, a = 0, b ^ 

Qx, a = b 

qx, a^b, (a, 6^0) (26) 

rax, a = 0, b ^ 

Qd, a = b 

-q D , a^b, (a,b^0) (27) 

—rriD, a = 0, b ^ 

Qx, a = b 

-qx, a^b, (a, b t^O) (28) 

~rhx, a = 0, 6 7^ 



so as to obtain an analytic expression with respect to n. 
This yields 



(29) 



lndet(Z„ + (3T) = nln(l + (3(Q X ~ q D qx)) 
/3(toto - 2m D m x + p) 



In 1 



1 + P(Qx - QDqx) 



TrQxQx 



+ ln( ff[dX a P x o(X° 

\J a=0 



= -zQxQx - nrhxmx - 
+ ln«( fdXe-^j n )) h , 



n(n — 1) 



-qxqx 



(30) 



and 



TiQ D Q D 1 



In det Q 



D 



2 2 

n - n(n — 1) 
-Qd - nm D m D qoQD 



fln(Q B 



1 / 1 qD + m D 

to) - o U - " a -t 

?d + to 



where £ = (Qx + to)X 2 /2 - ftX + A£ £ ^ +0 



XI 



(31) 
Fur- 



ther, the following replacement of variables is convenient 
in handling our computation in the limit of /3 — > oo: 
Qd + qD -» /3Qd, to ->■ /? 2 Xd, 1 - to -> xd/A 
Qx + to -> AQx, to ->■ Axx, Qx - to -> xx/A 

and A — > /3A. In /3 — > oo, integral with respect to X in 



eq. ([301), / dXe~^, is replaced to e -P4>(.h;Q x ,X) by applying 
the saddle point method. Inserting eqs. (|2"9"|) - (|3"Tj) and the 
rescaled variables into eq. ((24)) offers the expression of the 
zero temperature free energy density (|3|). 
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